B-76 technologies, a provider of high-end software services to customers in a variety of sectors, stores and manages confidential client data.Due to an upsurge in cyber attacks B-76 technologies is concerned about their network security. In order to resolve this problem, B-76 technologies wants to develop a "Network Intrusion Detection System (NIDS)" which monitors the network traffic for unusual activities and sends alerts when such activity is discovered. NIDS is essential for recognizing and preventing cyber attacks, safeguarding data integrity and confidentiality.
As an experienced Data Scientist appointed by the Company my objective is to develop a NIDS using machine learning algorithms that can detect and prevent network intrusion.The steps involved in NIDS are:
The dataset required for developing NIDS was gathered from Kaggle. The following is the hyperlink to the dataset:
https://www.kaggle.com/datasets/mostafaomar2372/nf-unsw-nb15
!pip install imblearn
!pip install category_encoders
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting imblearn
Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Requirement already satisfied: imbalanced-learn in /usr/local/lib/python3.10/dist-packages (from imblearn) (0.10.1)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.22.4)
Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.10.1)
Requirement already satisfied: scikit-learn>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.2.2)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn->imblearn) (3.1.0)
Installing collected packages: imblearn
Successfully installed imblearn-0.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting category_encoders
Downloading category_encoders-2.6.1-py2.py3-none-any.whl (81 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 81.9/81.9 kB 8.5 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (1.22.4)
Requirement already satisfied: scikit-learn>=0.20.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (1.2.2)
Requirement already satisfied: scipy>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (1.10.1)
Requirement already satisfied: statsmodels>=0.9.0 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (0.13.5)
Requirement already satisfied: pandas>=1.0.5 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (1.5.3)
Requirement already satisfied: patsy>=0.5.1 in /usr/local/lib/python3.10/dist-packages (from category_encoders) (0.5.3)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.5->category_encoders) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0.5->category_encoders) (2022.7.1)
Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from patsy>=0.5.1->category_encoders) (1.16.0)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.20.0->category_encoders) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.20.0->category_encoders) (3.1.0)
Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.10/dist-packages (from statsmodels>=0.9.0->category_encoders) (23.1)
Installing collected packages: category_encoders
Successfully installed category_encoders-2.6.1
To access the pre-existing modules or packages, libraries must be imported to this programming environment.
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
import sklearn.preprocessing
import category_encoders
import imblearn.over_sampling
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
import sklearn.metrics
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from xgboost import XGBClassifier
from sklearn.metrics import classification_report
import plotly.graph_objects as go
from plotly .subplots import make_subplots
The data required for NIDS is loaded and stored as a Data Frame for analysis. The head() function helps us to preview the first 5 initial rows of the dataset, providing an overview of the structure of the data.
unsw_df = pd.read_csv("NF-UNSW-NB15.csv")
unsw_df.head()
| Unnamed: 0 | IPV4_SRC_ADDR | L4_SRC_PORT | IPV4_DST_ADDR | L4_DST_PORT | PROTOCOL | L7_PROTO | IN_BYTES | IN_PKTS | OUT_BYTES | ... | TCP_WIN_MAX_IN | TCP_WIN_MAX_OUT | ICMP_TYPE | ICMP_IPV4_TYPE | DNS_QUERY_ID | DNS_QUERY_TYPE | DNS_TTL_ANSWER | FTP_COMMAND_RET_CODE | Label | Attack | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 975176 | 59.166.0.9 | 30659 | 149.171.126.5 | 53 | 17 | 0.0 | 146 | 2 | 178 | ... | 0 | 0 | 0 | 0 | 53862 | 1 | 60 | 0.0 | 0 | Benign |
| 1 | 1475060 | 59.166.0.2 | 41056 | 149.171.126.3 | 64665 | 6 | 0.0 | 320 | 6 | 1902 | ... | 7240 | 5792 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0 | Benign |
| 2 | 2149826 | 59.166.0.7 | 1867 | 149.171.126.9 | 53 | 17 | 0.0 | 146 | 2 | 178 | ... | 0 | 0 | 0 | 0 | 41710 | 1 | 60 | 0.0 | 0 | Benign |
| 3 | 931632 | 59.166.0.6 | 1235 | 149.171.126.5 | 31940 | 6 | 0.0 | 2230 | 34 | 15258 | ... | 20272 | 14480 | 6912 | 27 | 0 | 0 | 0 | 0.0 | 0 | Benign |
| 4 | 1614143 | 59.166.0.0 | 26575 | 149.171.126.2 | 21 | 6 | 1.0 | 2059 | 37 | 2816 | ... | 21720 | 18824 | 63744 | 249 | 0 | 0 | 0 | 125.0 | 0 | Benign |
5 rows × 46 columns
The UNSW-NB15 dataset, developed by the University of New South Wales is a comprehensive dataset designed for Network Intrusion Detection System.The dataset comprises of network packets that were captured using the IXIA PerfectStorm tool within the UNSW Canberra Cyber Range Lab. This dataset is a combination of real modern normal activities with synthetic contemporary attack behaviors. The dataset consist of 25,500 instances and 46 columns. The dataset comprises of diverse range of features extracted from network traffic data.Below is a list of features in the dataset, encompassing source and destination IP addresses, port numbers,flow duration and various other attributes.
df = pd.read_excel("NF-UNSWNB15-Features.xlsx")
df.head(45)
| FEATURES | DESCRIPTION | |
|---|---|---|
| 0 | IPV4_SRC_ADDR | IPv4 source address |
| 1 | IPV4_DST_ADDR | IPv4 destination address |
| 2 | L4_SRC_PORT | IPv4 source port number |
| 3 | L4_DST_PORT | IPv4 destination port number |
| 4 | PROTOCOL | IP protocol identifier byte |
| 5 | L7_PROTO | Layer 7 protocol (numeric) |
| 6 | IN_BYTES | Incoming number of bytes |
| 7 | OUT_BYTES | Outgoing number of bytes |
| 8 | IN_PKTS | Incoming number of packets |
| 9 | OUT_PKTS | Outgoing number of packets |
| 10 | FLOW_DURATION_MILLISECONDS | Flow duration in milliseconds |
| 11 | TCP_FLAGS | Cumulative of all TCP flags |
| 12 | CLIENT_TCP_FLAGS | Cumulative of all client TCP flags |
| 13 | SERVER_TCP_FLAGS | Cumulative of all server TCP flags |
| 14 | DURATION_IN | Client to Server stream duration (msec) |
| 15 | DURATION_OUT | Client to Server stream duration (msec) |
| 16 | MIN_TTL | Min flow TTL |
| 17 | MAX_TTL | Max flow TTL |
| 18 | LONGEST_FLOW_PKT | Longest packet (bytes) of the flow |
| 19 | SHORTEST_FLOW_PKT | Shortest packet (bytes) of the flow |
| 20 | MIN_IP_PKT_LEN | Len of the smallest flow IP packet observed |
| 21 | MAX_IP_PKT_LEN | Len of the largest flow IP packet observed |
| 22 | SRC_TO_DST_SECOND_BYTES | Src to dst Bytes/sec |
| 23 | DST_TO_SRC_SECOND_BYTES | Dst to src Bytes/sec |
| 24 | RETRANSMITTED_IN_BYTES | Number of retransmitted TCP flow bytes (src->dst) |
| 25 | RETRANSMITTED_IN_PKTS | Number of retransmitted TCP flow packets (src-... |
| 26 | RETRANSMITTED_OUT_BYTES | Number of retransmitted TCP flow bytes (dst->src) |
| 27 | RETRANSMITTED_OUT_PKTS | Number of retransmitted TCP flow packets (dst-... |
| 28 | SRC_TO_DST_AVG_THROUGHPUT | Src to dst average thpt (bps) |
| 29 | DST_TO_SRC_AVG_THROUGHPUT | Dst to src average thpt (bps) |
| 30 | NUM_PKTS_UP_TO_128_BYTES | Packets whose IP size <= 128 |
| 31 | NUM_PKTS_128_TO_256_BYTES | Packets whose IP size > 128 and <= 256 |
| 32 | NUM_PKTS_256_TO_512_BYTES | Packets whose IP size > 256 and <= 512 |
| 33 | NUM_PKTS_512_TO_1024_BYTES | Packets whose IP size > 512 and <= 1024 |
| 34 | NUM_PKTS_1024_TO_1514_BYTES | Packets whose IP size >��1024 and <= 1514 |
| 35 | TCP_WIN_MAX_IN | Max TCP Window (src->dst) |
| 36 | TCP_WIN_MAX_OUT | Max TCP Window (dst->src) |
| 37 | ICMP_TYPE | ICMP Type * 256 + ICMP code |
| 38 | ICMP_IPV4_TYPE | ICMP Type |
| 39 | DNS_QUERY_ID | DNS query transaction Id |
| 40 | DNS_QUERY_TYPE | DNS query type (e.g. 1=A, 2=NS..) |
| 41 | DNS_TTL_ANSWER | TTL of the first A record (if any) |
| 42 | FTP_COMMAND_RET_CODE | FTP client command return code |
| 43 | LABEL | Indicates the network traffic. 0 indicates Nor... |
| 44 | ATTACK | This column represents the type of "Attack" |
Data Exploration is a very crucial step in machine learning that helps us to analyze data and gain valuable insights from it. It provides us an overview of the data structure including information like column names, data type and missing values
unsw_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 25500 entries, 0 to 25499 Data columns (total 46 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 25500 non-null int64 1 IPV4_SRC_ADDR 25500 non-null object 2 L4_SRC_PORT 25500 non-null int64 3 IPV4_DST_ADDR 25500 non-null object 4 L4_DST_PORT 25500 non-null int64 5 PROTOCOL 25500 non-null int64 6 L7_PROTO 25500 non-null float64 7 IN_BYTES 25500 non-null int64 8 IN_PKTS 25500 non-null int64 9 OUT_BYTES 25500 non-null int64 10 OUT_PKTS 25500 non-null int64 11 TCP_FLAGS 25500 non-null int64 12 CLIENT_TCP_FLAGS 25500 non-null int64 13 SERVER_TCP_FLAGS 25500 non-null int64 14 FLOW_DURATION_MILLISECONDS 25500 non-null int64 15 DURATION_IN 25500 non-null int64 16 DURATION_OUT 25500 non-null int64 17 MIN_TTL 25500 non-null int64 18 MAX_TTL 25500 non-null int64 19 LONGEST_FLOW_PKT 25500 non-null int64 20 SHORTEST_FLOW_PKT 25500 non-null int64 21 MIN_IP_PKT_LEN 25500 non-null int64 22 MAX_IP_PKT_LEN 25500 non-null int64 23 SRC_TO_DST_SECOND_BYTES 25500 non-null float64 24 DST_TO_SRC_SECOND_BYTES 25500 non-null float64 25 RETRANSMITTED_IN_BYTES 25500 non-null int64 26 RETRANSMITTED_IN_PKTS 25500 non-null int64 27 RETRANSMITTED_OUT_BYTES 25500 non-null int64 28 RETRANSMITTED_OUT_PKTS 25500 non-null int64 29 SRC_TO_DST_AVG_THROUGHPUT 25500 non-null int64 30 DST_TO_SRC_AVG_THROUGHPUT 25500 non-null int64 31 NUM_PKTS_UP_TO_128_BYTES 25500 non-null int64 32 NUM_PKTS_128_TO_256_BYTES 25500 non-null int64 33 NUM_PKTS_256_TO_512_BYTES 25500 non-null int64 34 NUM_PKTS_512_TO_1024_BYTES 25500 non-null int64 35 NUM_PKTS_1024_TO_1514_BYTES 25500 non-null int64 36 TCP_WIN_MAX_IN 25500 non-null int64 37 TCP_WIN_MAX_OUT 25500 non-null int64 38 ICMP_TYPE 25500 non-null int64 39 ICMP_IPV4_TYPE 25500 non-null int64 40 DNS_QUERY_ID 25500 non-null int64 41 DNS_QUERY_TYPE 25500 non-null int64 42 DNS_TTL_ANSWER 25500 non-null int64 43 FTP_COMMAND_RET_CODE 25500 non-null float64 44 Label 25500 non-null int64 45 Attack 25500 non-null object dtypes: float64(4), int64(39), object(3) memory usage: 8.9+ MB
Dropping unwanted columns and converting the IPV4_SRC_ADDR & IPV4_DST_ADDR column to integer
The column unnamed is not important for the analysis, hence it is dropped from the dataframe. During the exploration of the dataset, it was found that the columns "IPV4_SRC_ADDR" & "IPV4_DST_ADDR" are represented as strings. Hence,in the following steps they are converted to integers by removing the dot.
unsw_df.drop(["Unnamed: 0"], axis =1, inplace = True)
unsw_df["IPV4_SRC_ADDR"] = unsw_df["IPV4_SRC_ADDR"].str.replace(".","").astype(int)
unsw_df["IPV4_DST_ADDR"] = unsw_df['IPV4_DST_ADDR'].str.replace(".","").astype(int)
unsw_df.head()
<ipython-input-6-46b07e62a1b1>:2: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
unsw_df["IPV4_SRC_ADDR"] = unsw_df["IPV4_SRC_ADDR"].str.replace(".","").astype(int)
<ipython-input-6-46b07e62a1b1>:3: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
unsw_df["IPV4_DST_ADDR"] = unsw_df['IPV4_DST_ADDR'].str.replace(".","").astype(int)
| IPV4_SRC_ADDR | L4_SRC_PORT | IPV4_DST_ADDR | L4_DST_PORT | PROTOCOL | L7_PROTO | IN_BYTES | IN_PKTS | OUT_BYTES | OUT_PKTS | ... | TCP_WIN_MAX_IN | TCP_WIN_MAX_OUT | ICMP_TYPE | ICMP_IPV4_TYPE | DNS_QUERY_ID | DNS_QUERY_TYPE | DNS_TTL_ANSWER | FTP_COMMAND_RET_CODE | Label | Attack | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5916609 | 30659 | 1491711265 | 53 | 17 | 0.0 | 146 | 2 | 178 | 2 | ... | 0 | 0 | 0 | 0 | 53862 | 1 | 60 | 0.0 | 0 | Benign |
| 1 | 5916602 | 41056 | 1491711263 | 64665 | 6 | 0.0 | 320 | 6 | 1902 | 8 | ... | 7240 | 5792 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0 | Benign |
| 2 | 5916607 | 1867 | 1491711269 | 53 | 17 | 0.0 | 146 | 2 | 178 | 2 | ... | 0 | 0 | 0 | 0 | 41710 | 1 | 60 | 0.0 | 0 | Benign |
| 3 | 5916606 | 1235 | 1491711265 | 31940 | 6 | 0.0 | 2230 | 34 | 15258 | 36 | ... | 20272 | 14480 | 6912 | 27 | 0 | 0 | 0 | 0.0 | 0 | Benign |
| 4 | 5916600 | 26575 | 1491711262 | 21 | 6 | 1.0 | 2059 | 37 | 2816 | 39 | ... | 21720 | 18824 | 63744 | 249 | 0 | 0 | 0 | 125.0 | 0 | Benign |
5 rows × 45 columns
While examining the dataset , it was observed that more than half of the data instances were labelled 0, and the remaining 1. This labelling pattern may introduce bias in the training and evaluation process. To address this issue, the dataset was shuffled ensuring a more even distribution of the labelled categories.
unsw_df = unsw_df.sample(frac=1).reset_index(drop= True)
unsw_df.head(15)
| IPV4_SRC_ADDR | L4_SRC_PORT | IPV4_DST_ADDR | L4_DST_PORT | PROTOCOL | L7_PROTO | IN_BYTES | IN_PKTS | OUT_BYTES | OUT_PKTS | ... | TCP_WIN_MAX_IN | TCP_WIN_MAX_OUT | ICMP_TYPE | ICMP_IPV4_TYPE | DNS_QUERY_ID | DNS_QUERY_TYPE | DNS_TTL_ANSWER | FTP_COMMAND_RET_CODE | Label | Attack | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5916601 | 54210 | 1491711269 | 21 | 6 | 1.0 | 2059 | 37 | 2814 | 39 | ... | 21720 | 18824 | 63744 | 249 | 0 | 0 | 0 | 125.0 | 0 | Benign |
| 1 | 5916600 | 56539 | 1491711265 | 53 | 17 | 0.0 | 146 | 2 | 178 | 2 | ... | 0 | 0 | 0 | 0 | 31102 | 1 | 60 | 0.0 | 0 | Benign |
| 2 | 5916609 | 59133 | 1491711269 | 80 | 6 | 7.0 | 1044 | 8 | 824 | 10 | ... | 7240 | 7240 | 27136 | 106 | 0 | 0 | 0 | 0.0 | 0 | Benign |
| 3 | 5916608 | 26923 | 1491711265 | 53 | 17 | 0.0 | 146 | 2 | 178 | 2 | ... | 0 | 0 | 0 | 0 | 52366 | 1 | 60 | 0.0 | 0 | Benign |
| 4 | 5916608 | 53655 | 1491711264 | 21 | 6 | 1.0 | 481 | 9 | 750 | 11 | ... | 10136 | 10136 | 33792 | 132 | 0 | 0 | 0 | 229.0 | 0 | Benign |
| 5 | 5916604 | 57919 | 1491711265 | 5190 | 6 | 0.0 | 1470 | 22 | 1728 | 14 | ... | 10136 | 11584 | 28416 | 111 | 0 | 0 | 0 | 0.0 | 0 | Benign |
| 6 | 5916609 | 61832 | 1491711263 | 18099 | 6 | 0.0 | 2766 | 44 | 27770 | 46 | ... | 27512 | 14480 | 8960 | 35 | 0 | 0 | 0 | 0.0 | 0 | Benign |
| 7 | 175451763 | 35333 | 14917112611 | 80 | 6 | 7.0 | 1270 | 12 | 5082 | 12 | ... | 16383 | 16383 | 51968 | 203 | 0 | 0 | 0 | 0.0 | 0 | Benign |
| 8 | 5916605 | 37915 | 1491711267 | 21 | 6 | 1.0 | 1817 | 33 | 2510 | 35 | ... | 20272 | 17376 | 46080 | 180 | 0 | 0 | 0 | 229.0 | 0 | Benign |
| 9 | 5916603 | 57728 | 1491711268 | 21 | 6 | 1.0 | 481 | 9 | 750 | 11 | ... | 10136 | 10136 | 33792 | 132 | 0 | 0 | 0 | 229.0 | 0 | Benign |
| 10 | 5916604 | 30017 | 1491711269 | 53 | 17 | 0.0 | 146 | 2 | 178 | 2 | ... | 0 | 0 | 0 | 0 | 53476 | 1 | 60 | 0.0 | 0 | Benign |
| 11 | 5916607 | 17767 | 1491711268 | 15938 | 6 | 0.0 | 3302 | 54 | 35400 | 56 | ... | 34752 | 14480 | 11008 | 43 | 0 | 0 | 0 | 0.0 | 0 | Benign |
| 12 | 5916606 | 3613 | 1491711263 | 54195 | 6 | 0.0 | 3926 | 66 | 56022 | 68 | ... | 43440 | 14480 | 11008 | 43 | 0 | 0 | 0 | 0.0 | 0 | Benign |
| 13 | 5916603 | 7406 | 1491711266 | 2413 | 6 | 0.0 | 2854 | 46 | 29168 | 48 | ... | 28960 | 14480 | 6912 | 27 | 0 | 0 | 0 | 0.0 | 0 | Benign |
| 14 | 5916603 | 19508 | 1491711269 | 17087 | 6 | 0.0 | 4014 | 68 | 61374 | 70 | ... | 44888 | 14480 | 8960 | 35 | 0 | 0 | 0 | 0.0 | 0 | Benign |
15 rows × 45 columns
checking for missing values in the dataset.
unsw_df.isna().sum()
IPV4_SRC_ADDR 0 L4_SRC_PORT 0 IPV4_DST_ADDR 0 L4_DST_PORT 0 PROTOCOL 0 L7_PROTO 0 IN_BYTES 0 IN_PKTS 0 OUT_BYTES 0 OUT_PKTS 0 TCP_FLAGS 0 CLIENT_TCP_FLAGS 0 SERVER_TCP_FLAGS 0 FLOW_DURATION_MILLISECONDS 0 DURATION_IN 0 DURATION_OUT 0 MIN_TTL 0 MAX_TTL 0 LONGEST_FLOW_PKT 0 SHORTEST_FLOW_PKT 0 MIN_IP_PKT_LEN 0 MAX_IP_PKT_LEN 0 SRC_TO_DST_SECOND_BYTES 0 DST_TO_SRC_SECOND_BYTES 0 RETRANSMITTED_IN_BYTES 0 RETRANSMITTED_IN_PKTS 0 RETRANSMITTED_OUT_BYTES 0 RETRANSMITTED_OUT_PKTS 0 SRC_TO_DST_AVG_THROUGHPUT 0 DST_TO_SRC_AVG_THROUGHPUT 0 NUM_PKTS_UP_TO_128_BYTES 0 NUM_PKTS_128_TO_256_BYTES 0 NUM_PKTS_256_TO_512_BYTES 0 NUM_PKTS_512_TO_1024_BYTES 0 NUM_PKTS_1024_TO_1514_BYTES 0 TCP_WIN_MAX_IN 0 TCP_WIN_MAX_OUT 0 ICMP_TYPE 0 ICMP_IPV4_TYPE 0 DNS_QUERY_ID 0 DNS_QUERY_TYPE 0 DNS_TTL_ANSWER 0 FTP_COMMAND_RET_CODE 0 Label 0 Attack 0 dtype: int64
Checking the Label Distribution:
unsw_df.Label.value_counts()
0 24500 1 1000 Name: Label, dtype: int64
Data Visualization:
The bar graph presented below provides valuable insights into the distribution of attack types and also helps us to identify the most common attack type in the network.
plt.figure(figsize = (10,5))
ax = unsw_df.Attack.value_counts().plot(kind = "bar")
plt.title("Attack Types")
plt.xlabel("Attack Type")
plt.ylabel("Frequency")
for i,count in enumerate(unsw_df.Attack.value_counts()):
ax.text(i, count+1.0, str(count), ha = "center", va = "bottom")
plt.show()
To perform model training and evaluation, the dataset is split into training and testing subsets. 80% of the data is used for training and 20% for testing. The training set will train the model on known data, allowing them to learn patterns and relationships between features and label. The testing set is for model assessment on unseen data, measuring their ability to predict Labels.
X = unsw_df.drop(["Label"], axis = 1)
Y = unsw_df.Label
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X,Y, test_size = 0.2)
print("The Shape of unsw_df is ", unsw_df.shape)
print("The Shape of X_train is ", X_train.shape)
print("The Shape of X_test is ", X_test.shape)
print("The Shape of Y_train is ", Y_train.shape)
print("The Shape of Y_test is ", Y_test.shape)
The Shape of unsw_df is (25500, 45) The Shape of X_train is (20400, 44) The Shape of X_test is (5100, 44) The Shape of Y_train is (20400,) The Shape of Y_test is (5100,)
Finding correlation betweeb Features
The importance of using a heatmap in NIDS lies in its abilities to provide a clear picture of the interdependancies of features, reflecting potential indicators of network intrusions.
plt.figure(figsize = (32,25))
sns.heatmap(unsw_df.corr(),annot = True, cmap = "viridis")
plt.show()
<ipython-input-12-3bc1909ea1ea>:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. sns.heatmap(unsw_df.corr(),annot = True, cmap = "viridis")
To make the analysis easier the categorical features are converted to numerical features using Binary Encoders. This will enhance the performance of the model.
import category_encoders
be = category_encoders.binary.BinaryEncoder()
be.fit(X_train)
X_train = be.transform(X_train)
X_test = be.transform(X_test)
print("X_train:", X_train.shape)
print("x_test:", X_test.shape)
X_train: (20400, 47) x_test: (5100, 47)
In order to ensure model Stability we use feature scaling,a technique that ensures all features are on a similar scale, preventing some features from dominating others.
feature_scaler = sklearn.preprocessing.StandardScaler(with_mean = False)
feature_scaler.fit(X_train)
X_train= feature_scaler.transform(X_train)
X_test = feature_scaler.transform(X_test)
print("X_train:", X_train.shape)
print("X_test:", X_test.shape)
X_train: (20400, 47) X_test: (5100, 47)
Checking for class imbalance
label_distribution = Y_train.value_counts()
colours = ["red","blue"]
plt.bar(label_distribution.index,label_distribution.values, color = colours)
plt.xlabel("Label")
plt.ylabel("Frequency")
plt.xticks([0,1],["0", "1"])
plt.title("Label Distribution")
for i,count in enumerate(label_distribution):
plt.text(i, count+1.0, str(count), ha = "center", va = "bottom")
The graph plotted shows that there is a significant imbalance in the dataset. This may lead to biasing towards the majority class. To avoid this the classes are balanced using the SMOTE technique.
over_sampling_sm = imblearn.over_sampling.SMOTE()
X_train,Y_train = over_sampling_sm.fit_resample(X_train,Y_train)
Y_train.value_counts()
0 19589 1 19589 Name: Label, dtype: int64
To assess the performance and effectiveness of the model, a comprehensive evaluation is used to measure its accuracy and reliability. The models selected for training are Random forest, XG Boosting, MLP Classifier. To enhance the performance of the model hyper parameter tuning is also done. It also incorporates accuracy score, classification report and ROC CURVE analysis. The accuracy score indicates proportion of correctly classified instances. The classification report provides detyailed metrics such as precision, recall and F1score showing the model's ability to balance precision and recall for each class. Finally the ROC Curve demonstrates the model's ability to classify between positive and negative classes.
Random forest has the ability to handle high dimensional data and effectively deals with overfitting. It uses an ensemble of decision trees and incorporates randomness in the feature selection, resulting in accurate predictions.
random_forest = RandomForestClassifier()
random_forest_params = {
"n_estimators":[10,50,100],
"max_depth": [None,3,5],
}
random_forest_grid_search = GridSearchCV(random_forest, random_forest_params, cv = 5)
rf = random_forest_grid_search.fit(X_train,Y_train )
random_forest_predicted = random_forest_grid_search.predict(X_test)
random_forest_accuracy_score = sklearn.metrics.accuracy_score(Y_test,random_forest_predicted)
report_random_forest = classification_report(Y_test,random_forest_predicted)
print(report_random_forest)
print("The accuracy score of Random Forest algorithm is ",random_forest_accuracy_score*100,"%")
print('Random Forest best parameters:', random_forest_grid_search.best_params_)
precision recall f1-score support
0 1.00 1.00 1.00 4911
1 0.99 1.00 0.99 189
accuracy 1.00 5100
macro avg 0.99 1.00 1.00 5100
weighted avg 1.00 1.00 1.00 5100
The accuracy score of Random Forest algorithm is 99.9607843137255 %
Random Forest best parameters: {'max_depth': None, 'n_estimators': 100}
ROC CURVE FOR RANDOM FOREST
from sklearn.metrics import roc_curve, roc_auc_score
rf_prob = rf.predict_proba(X_test)[:,1]
fpr,tpr,thresholds = roc_curve(Y_test,rf_prob )
auc_rf = roc_auc_score(Y_test,rf_prob)
train_accuracy = rf.score(X_train, Y_train)
test_accuracy = rf.score(X_test, Y_test)
plt.plot(fpr,tpr, label = "Random Forest (Training Accuracy: {:,.2F},Testing Accuracy: {:,.2F})".format(train_accuracy,test_accuracy))
plt.plot([0,1], [0,1], linestyle = "--", color = "r", label = "Reference line")
plt.xlabel("FALSE POSITIVE RATE")
plt.ylabel("TRUE POSITIVE RATE")
plt.title("ROC CURVE")
plt.legend(loc = "lower right")
plt.show()
print ("The AUC score of Random forest Classifier is ",auc_rf )
The AUC score of Random forest Classifier is 0.9999983839324096
XG Boosting is a powerful gradient boosting algorithm, which combines multiple weak learners to create a strong predictive model. it also has built in regularization techniques to prevent overfitting.
xgb = XGBClassifier()
xgb_params = {
"n_estimators":[50,100,150],
"max_depth":[3,6,9],
"learning_rate":[0.1,0.01,0.001]
}
xgb_grid_search = GridSearchCV(xgb, xgb_params, cv = 5)
xgb_classifier = xgb_grid_search.fit(X_train,Y_train )
xgb_predicted = xgb_grid_search.predict(X_test)
xgb_accuracy_score = sklearn.metrics.accuracy_score(Y_test,xgb_predicted)
report_xgb = classification_report(Y_test,xgb_predicted)
print(report_xgb)
print("The accuracy score of XGBoost algorithm is ",xgb_accuracy_score*100,"%" )
print('XG Boost best parameters:', xgb_grid_search.best_params_)
precision recall f1-score support
0 1.00 1.00 1.00 4911
1 1.00 1.00 1.00 189
accuracy 1.00 5100
macro avg 1.00 1.00 1.00 5100
weighted avg 1.00 1.00 1.00 5100
The accuracy score of XGBoost algorithm is 100.0 %
XG Boost best parameters: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100}
ROC CURVE FOR XGB CLASSIFIER
from sklearn.metrics import roc_curve, roc_auc_score
xgb_prob = xgb_classifier.predict_proba(X_test)[:,1]
fpr,tpr,thresholds = roc_curve(Y_test,xgb_prob )
auc_xgb = roc_auc_score(Y_test,xgb_prob)
train_accuracy = xgb_classifier.score(X_train, Y_train)
test_accuracy = xgb_classifier.score(X_test, Y_test)
plt.plot(fpr,tpr, label = "XGB (Training Accuracy: {:,.2F},Testing Accuracy: {:,.2F})".format(train_accuracy,test_accuracy))
plt.plot([0,1], [0,1], linestyle = "--", color = "r", label = "Reference line")
plt.xlabel("FALSE POSITIVE RATE")
plt.ylabel("TRUE POSITIVE RATE")
plt.title("ROC CURVE")
plt.legend()
plt.show()
print ("The AUC score of XGB Classifier is ",auc_xgb )
The AUC score of XGB Classifier is 1.0
MLP can handle complex non linear relationships between features . The dataset has a wide range of features extracted from network traffic and MLP's ability to model and learn intricate relationships makes it a suitable choice for this task
mlp = MLPClassifier()
mlp_params = {
"hidden_layer_sizes": [(10,), (50,), (100,)],
"activation": ["relu","tanh"],
}
mlp_grid_search = GridSearchCV(mlp, mlp_params, cv = 5)
mlp_classifier = mlp_grid_search.fit(X_train,Y_train )
mlp_predicted = mlp_grid_search.predict(X_test)
mlp_accuracy_score = sklearn.metrics.accuracy_score(Y_test,mlp_predicted)
report_mlp = classification_report(Y_test,mlp_predicted)
print(report_mlp)
print("The accuracy score of MLP classifier is ",mlp_accuracy_score*100,"%")
print('MLP best parameters:', mlp_grid_search.best_params_)
precision recall f1-score support
0 1.00 1.00 1.00 4911
1 1.00 1.00 1.00 189
accuracy 1.00 5100
macro avg 1.00 1.00 1.00 5100
weighted avg 1.00 1.00 1.00 5100
The accuracy score of MLP classifier is 100.0 %
MLP best parameters: {'activation': 'relu', 'hidden_layer_sizes': (10,)}
ROC CURVE FOR MLP CLASSIFIER
from sklearn.metrics import roc_curve, roc_auc_score
mlp_prob = mlp_classifier.predict_proba(X_test)[:,1]
fpr,tpr,thresholds = roc_curve(Y_test,mlp_prob )
auc_mlp = roc_auc_score(Y_test,mlp_prob)
train_accuracy = mlp_classifier.score(X_train, Y_train)
test_accuracy = mlp_classifier.score(X_test, Y_test)
plt.plot(fpr,tpr, label = "MLP (Training Accuracy: {:,.2F},Testing Accuracy: {:,.2F})".format(train_accuracy,test_accuracy))
plt.plot([0,1], [0,1], linestyle = "--", color = "r", label = "Reference line")
plt.xlabel("FALSE POSITIVE RATE")
plt.ylabel("TRUE POSITIVE RATE")
plt.title("ROC CURVE")
plt.legend()
plt.show()
print ("The AUC score of MLP Classifier is ",auc_mlp )
The AUC score of MLP Classifier is 1.0
The ROC Curve analysis of all 3 algorithms yielded perfect performance with AUC Score of 1 indicating perfect classification between the two classes.
LSTM Network are known for their ability to capture long-term dependencies and are widely used in sequential data analysis. However, the model's tend to overfit while used on this dataset and had less accuracy compared to the 3 models used here. There was a significant gap between training accuracy and validation accuracy indicating overfitting of the data. A snippet of the code and graph is given below:
The Scatter plot helps us to visualize how well the model's prediction allign with the actual values. The plot below indicates a high level of accuracy between the predicted and actual values.
fig = go.Figure()
fig = make_subplots (rows = 3, cols = 1)
fig.add_trace(go.Scatter(x = list(range(len(Y_test))), y = Y_test, mode = "markers", name = "Actual Value"), row = 1,col=1)
fig.add_trace(go.Scatter(x = list(range(len(Y_test))), y = random_forest_predicted, mode = "lines", name = "Random Forest Predicted"), row = 1,col=1)
fig.add_trace(go.Scatter(x = list(range(len(Y_test))), y = Y_test, mode = "markers", name = "Actual Value"), row = 2,col=1)
fig.add_trace(go.Scatter(x = list(range(len(Y_test))), y = xgb_predicted, mode = "lines", name = "XGB Predicted"), row = 2,col=1)
fig.add_trace(go.Scatter(x = list(range(len(Y_test))), y = Y_test, mode = "markers", name = "Actual Value"), row = 3,col=1)
fig.add_trace(go.Scatter(x = list(range(len(Y_test))), y = mlp_predicted, mode = "lines", name = "MLP Predicted"), row = 3,col=1)
fig.update_layout(height = 800, width = 800, title ="Actual vs Predicted values")
fig.update_xaxes(title_text = "No Of Data Points")
fig.update_yaxes(title_text = "Label", row =1, col =1)
fig.update_yaxes(title_text = "Label", row =2, col =1)
fig.update_yaxes(title_text = "Label", row =3, col =1)
fig.show()
The function "initiate_warning" is responsible for generating a warning and blocking the source ip address when a network intrusion is detected.
def initiate_warning(ip):
print(f"WARNING! Network intrusion detected from the source IP address {IPV4_SRC_ADDR}" )
os.system(f"sudo iptables -A INPUT -s {IPV4_SRC_ADDR} -j DROP")
print(f"The IP {IPV4_SRC_ADDR} has been BLOCKED!")
IPV4_SRC_ADDR = "175.451.760"
initiate_warning(IPV4_SRC_ADDR)
WARNING! Network intrusion detected from the source IP address 175.451.760 The IP 175.451.760 has been BLOCKED!
After comprehensive analysis of the 3 algorithms,it is evident that each algorithm has showcased remarkable performance in the context of NIDS. The accuracy score and precision of MLP & XGB is slightly higher than that of Random forest. Random forest while performing exceptionally well exhibited slightly lower precision for intrusion activities.MLP & XGB have perfect precision , recall and F1 Score indicating that they can accurately classify intrusive and non-intrusive activities. Considering the performance of all three algorithms it is evident that each one has their unique advantages for NIDS tasks and they have performed remarkably well on the dataset offering valuable insights for effective intrusion detection and network security.
Even though Random forest, XGB and Multilayer Perceptron(MLP) algorithms offer promising results for NIDS,they are not immune to limitations. The limitations include:
The Network intrusion Detection System developed for B-76 Technologies has displayed favourable results for accuray, precision and recall. The succesful implementation of this system empowers the company with an essential tool for protecting a network from unauthorized access and malicious activities. They have the ability to detect anomalies and alert the administrators of potential security breaches. Thus, the NIDS improves the company's overall security posture by accurately classifying network traffic as normal or malicious. Although NIDS are not infallible and have limitations in detecting advanced attacks, integrating with additional security measures and continously fine tuning can increase their effectiveness. In conclusion, NIDS can enable the company to identify and mitigate potential threats, uyltimately boltering the network's security infrasteructure.
research.unsw.edu.au. (n.d.). The UNSW-NB15 Dataset | UNSW Research. [online] Available at: https://research.unsw.edu.au/projects/unsw-nb15-dataset.
Ahmad, M., Riaz, Q., Zeeshan, M., Tahir, H., Haider, S.A. and Khan, M.S. (2021). Intrusion detection in internet of things using supervised machine learning based on application and transport layer features using UNSW-NB15 data-set. EURASIP Journal on Wireless Communications and Networking, 2021(1). doi:https://doi.org/10.1186/s13638-021-01893-8.
Subrata Maji (2020). Building an Intrusion Detection System on UNSW-NB15 Dataset Based on Machine Learning Algorithm. [online] Medium. Available at: https://medium.com/@subrata.maji16/building-an-intrusion-detection-system-on-unsw-nb15-dataset-based-on-machine-learning-algorithm-16b1600996f5.
Moustafa, N. and Slay, J. (2015). UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set). [online] IEEE Xplore. doi:https://doi.org/10.1109/MilCIS.2015.7348942.
Al, M.S. et (2019). Network Based Intrusion Detection Using the UNSW-NB15 Dataset. International Journal of Computing and Digital Systems, [online] 8(5), p.477. Available at: https://www.academia.edu/40842534/Network_Based_Intrusion_Detection_Using_the_UNSW_NB15_Dataset [Accessed 26 Jun. 2023].